GROUP PROJECT

HOUSE SALES PRICE PREDICTION AND EDA

PROBLEM STATEMENT

Buying a new house is hell of a difficult task. Prices of houses vary because of many factors like number of bedrooms, size of basement and many more. Thus we have made it easy for buyer to search their dream house by developing a program that will predict house price depending on various attributes of the house. We have considered 24 attributes(database is described in the execution) to predict 1 variable i.e. SalesPrice. This program will help buyers to understand that if a particular house is in their budget or not. Many times broker increases the price of the house just to earn extra commision. Buyers will be able to understand if the broker is telling the actual price or not. This program can also help real estate brokers to sell the property at right price.

GOAL

It is our job to predict the sales price for each house. For each Id in the test set, we must predict the value of the SalePrice variable. We will also find attributes which majorly affect SalesPrice.

AUDIENCE

This program will help :

1. House buyers
2. Real estate brokers

Let us begin with understanding data.

Here we import all libraries.

In [1]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 

Let us read the dataset.

In [2]:
data=pd.read_csv("C:\\Users\\Admin\\Desktop\\Main_data.csv")

Following steps will help us understand the profile of the dataset.

In [3]:
import pandas as pd
import pandas_profiling
pandas_profiling.ProfileReport(data)
Out[3]:

In [4]:
data.head()
Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street LotShape Utilities OverallQual OverallCond ... LowQualFinSF BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr GarageArea Fence SalePrice
0 1461 20 RH 80.0 11622 Pave Reg AllPub 5 6 ... 0 0.0 0.0 1 0 2 1 730.0 MnPrv 169277.0525
1 1462 20 RL 81.0 14267 Pave IR1 AllPub 6 6 ... 0 0.0 0.0 1 1 3 1 312.0 NaN 187758.3940
2 1463 60 RL 74.0 13830 Pave IR1 AllPub 5 5 ... 0 0.0 0.0 2 1 3 1 482.0 MnPrv 183583.6836
3 1464 60 RL 78.0 9978 Pave IR1 AllPub 6 6 ... 0 0.0 0.0 2 1 3 1 470.0 NaN 179317.4775
4 1465 120 RL 43.0 5005 Pave IR1 AllPub 8 5 ... 0 0.0 0.0 2 0 2 1 506.0 NaN 150730.0800

5 rows × 24 columns

Data Cleaning

Data cleaning is the process of detecting and correcting(or removing) corrupt or inaccurate records from a record set, table or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parta of the data and then replacing, modifying or deleting the dirty or coarse data.

Let us see how many values are null. False indicates that it is not a null value. True indicates it is a null value.

In [5]:
data.isnull()
Out[5]:
Id MSSubClass MSZoning LotFrontage LotArea Street LotShape Utilities OverallQual OverallCond ... LowQualFinSF BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr GarageArea Fence SalePrice
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False True False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False True False
4 False False False False False False False False False False ... False False False False False False False False True False
5 False False False False False False False False False False ... False False False False False False False False True False
6 False False False True False False False False False False ... False False False False False False False False False False
7 False False False False False False False False False False ... False False False False False False False False True False
8 False False False False False False False False False False ... False False False False False False False False True False
9 False False False False False False False False False False ... False False False False False False False False False False
10 False False False False False False False False False False ... False False False False False False False False True False
11 False False False False False False False False False False ... False False False False False False False False True False
12 False False False False False False False False False False ... False False False False False False False False True False
13 False False False False False False False False False False ... False False False False False False False False True False
14 False False False False False False False False False False ... False False False False False False False False True False
15 False False False False False False False False False False ... False False False False False False False False True False
16 False False False False False False False False False False ... False False False False False False False False True False
17 False False False False False False False False False False ... False False False False False False False False True False
18 False False False False False False False False False False ... False False False False False False False False True False
19 False False False False False False False False False False ... False False False False False False False False True False
20 False False False False False False False False False False ... False False False False False False False False True False
21 False False False False False False False False False False ... False False False False False False False False True False
22 False False False False False False False False False False ... False False False False False False False False True False
23 False False False False False False False False False False ... False False False False False False False False True False
24 False False False False False False False False False False ... False False False False False False False False True False
25 False False False False False False False False False False ... False False False False False False False False True False
26 False False False False False False False False False False ... False False False False False False False False True False
27 False False False False False False False False False False ... False False False False False False False False True False
28 False False False False False False False False False False ... False False False False False False False False True False
29 False False False False False False False False False False ... False False False False False False False False True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1429 False False False False False False False False False False ... False False False False False False False False False False
1430 False False False False False False False False False False ... False False False False False False False False True False
1431 False False False False False False False False False False ... False False False False False False False False True False
1432 False False False False False False False False False False ... False False False False False False False False True False
1433 False False False False False False False False False False ... False False False False False False False False True False
1434 False False False False False False False False False False ... False False False False False False False False True False
1435 False False False False False False False False False False ... False False False False False False False False True False
1436 False False False False False False False False False False ... False False False False False False False False True False
1437 False False False False False False False False False False ... False False False False False False False False True False
1438 False False False False False False False False False False ... False False False False False False False False True False
1439 False False False False False False False False False False ... False False False False False False False False True False
1440 False False False True False False False False False False ... False False False False False False False False True False
1441 False False False True False False False False False False ... False False False False False False False False True False
1442 False False False False False False False False False False ... False False False False False False False False True False
1443 False False False False False False False False False False ... False False False False False False False False True False
1444 False False True False False False False False False False ... False False False False False False False False True False
1445 False False False False False False False False False False ... False False False False False False False False True False
1446 False False False False False False False False False False ... False False False False False False False False True False
1447 False False False False False False False False False False ... False False False False False False False False True False
1448 False False False True False False False False False False ... False False False False False False False False True False
1449 False False False False False False False False False False ... False False False False False False False False True False
1450 False False False False False False False False False False ... False False False False False False False False True False
1451 False False False False False False False False False False ... False False False False False False False False True False
1452 False False False False False False False False False False ... False False False False False False False False True False
1453 False False False False False False False False False False ... False False False False False False False False False False
1454 False False False False False False False False False False ... False False False False False False False False True False
1455 False False False False False False False False False False ... False False False False False False False False True False
1456 False False False False False False False False False False ... False False False False False False False False True False
1457 False False False False False False False False False False ... False False False False False False False False False False
1458 False False False False False False False False False False ... False False False False False False False False True False

1459 rows × 24 columns

From the above step we observed that column Fence and LowQualFinSF has many null values. Hence we will not consider those columns while predicting the prices. Let us drop the two columns.

In [6]:
data= data.drop(columns="Fence")
In [7]:
data= data.drop(columns="LowQualFinSF")
In [8]:
data.head()
Out[8]:
Id MSSubClass MSZoning LotFrontage LotArea Street LotShape Utilities OverallQual OverallCond ... 1stFlrSF 2ndFlrSF BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr GarageArea SalePrice
0 1461 20 RH 80.0 11622 Pave Reg AllPub 5 6 ... 896 0 0.0 0.0 1 0 2 1 730.0 169277.0525
1 1462 20 RL 81.0 14267 Pave IR1 AllPub 6 6 ... 1329 0 0.0 0.0 1 1 3 1 312.0 187758.3940
2 1463 60 RL 74.0 13830 Pave IR1 AllPub 5 5 ... 928 701 0.0 0.0 2 1 3 1 482.0 183583.6836
3 1464 60 RL 78.0 9978 Pave IR1 AllPub 6 6 ... 926 678 0.0 0.0 2 1 3 1 470.0 179317.4775
4 1465 120 RL 43.0 5005 Pave IR1 AllPub 8 5 ... 1280 0 0.0 0.0 2 0 2 1 506.0 150730.0800

5 rows × 22 columns

In above step we have dropped columns with null values. Let us drop rows which contains 'NA' values. Rows will 'NA' values are of no use while predicting SalePrices of house.

In [9]:
df=data.dropna(how='any',axis=0)
df
Out[9]:
Id MSSubClass MSZoning LotFrontage LotArea Street LotShape Utilities OverallQual OverallCond ... 1stFlrSF 2ndFlrSF BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr GarageArea SalePrice
0 1461 20 RH 80.0 11622 Pave Reg AllPub 5 6 ... 896 0 0.0 0.0 1 0 2 1 730.0 169277.0525
1 1462 20 RL 81.0 14267 Pave IR1 AllPub 6 6 ... 1329 0 0.0 0.0 1 1 3 1 312.0 187758.3940
2 1463 60 RL 74.0 13830 Pave IR1 AllPub 5 5 ... 928 701 0.0 0.0 2 1 3 1 482.0 183583.6836
3 1464 60 RL 78.0 9978 Pave IR1 AllPub 6 6 ... 926 678 0.0 0.0 2 1 3 1 470.0 179317.4775
4 1465 120 RL 43.0 5005 Pave IR1 AllPub 8 5 ... 1280 0 0.0 0.0 2 0 2 1 506.0 150730.0800
5 1466 60 RL 75.0 10000 Pave IR1 AllPub 6 5 ... 763 892 0.0 0.0 2 1 3 1 440.0 177150.9892
7 1468 60 RL 63.0 8402 Pave IR1 AllPub 6 5 ... 789 676 0.0 0.0 2 1 3 1 393.0 175110.9565
8 1469 20 RL 85.0 10176 Pave Reg AllPub 7 5 ... 1341 0 1.0 0.0 1 1 2 1 506.0 162011.6988
9 1470 20 RL 70.0 8400 Pave Reg AllPub 4 5 ... 882 0 1.0 0.0 1 0 2 1 525.0 160726.2478
10 1471 120 RH 26.0 5858 Pave IR1 AllPub 7 5 ... 1337 0 1.0 0.0 2 0 2 1 511.0 157933.2795
11 1472 160 RM 21.0 1680 Pave Reg AllPub 6 5 ... 483 504 0.0 0.0 1 1 2 1 264.0 145291.2450
12 1473 160 RM 21.0 1680 Pave Reg AllPub 5 5 ... 525 567 0.0 0.0 1 1 3 1 320.0 159672.0176
13 1474 160 RL 24.0 2280 Pave Reg AllPub 6 6 ... 855 601 0.0 0.0 2 1 3 1 440.0 164167.5183
14 1475 120 RL 24.0 2280 Pave Reg AllPub 7 6 ... 836 0 0.0 0.0 1 0 2 1 308.0 150891.6382
15 1476 60 RL 102.0 12858 Pave IR1 AllPub 9 5 ... 1627 707 0.0 0.0 2 1 3 1 751.0 179460.9652
16 1477 20 RL 94.0 12883 Pave IR1 AllPub 8 5 ... 1544 0 0.0 0.0 2 0 3 1 868.0 185034.6289
17 1478 20 RL 90.0 11520 Pave Reg AllPub 9 5 ... 1698 0 0.0 0.0 2 0 3 1 730.0 182352.1926
18 1479 20 RL 79.0 14122 Pave IR1 AllPub 8 5 ... 1822 0 0.0 0.0 2 0 3 1 678.0 183053.4582
19 1480 20 RL 110.0 14300 Pave Reg AllPub 9 5 ... 2696 0 1.0 0.0 2 1 3 1 958.0 187823.3393
20 1481 60 RL 105.0 13650 Pave Reg AllPub 8 5 ... 1687 563 1.0 0.0 2 1 3 1 756.0 186544.1143
21 1482 120 RL 41.0 7132 Pave IR1 AllPub 8 5 ... 1370 0 0.0 0.0 2 0 2 1 484.0 158230.7752
22 1483 20 RL 100.0 18494 Pave IR1 AllPub 6 5 ... 1324 0 0.0 0.0 2 0 3 1 430.0 190552.8293
23 1484 120 RL 43.0 3203 Pave Reg AllPub 7 5 ... 1145 0 0.0 0.0 2 0 2 1 437.0 147183.6749
24 1485 80 RL 67.0 13300 Pave IR1 AllPub 7 5 ... 744 630 1.0 0.0 2 1 3 1 400.0 185855.3009
25 1486 60 RL 63.0 8577 Pave IR1 AllPub 7 5 ... 847 886 0.0 0.0 2 1 3 1 433.0 174350.4707
26 1487 60 RL 60.0 17433 Pave IR2 AllPub 8 5 ... 1645 830 0.0 0.0 2 1 4 1 962.0 201740.6207
27 1488 20 RL 73.0 8987 Pave Reg AllPub 8 5 ... 1595 0 0.0 0.0 2 0 2 1 880.0 162986.3789
28 1489 20 FV 92.0 9215 Pave Reg AllPub 7 5 ... 1218 0 0.0 0.0 2 0 2 1 676.0 162330.1991
29 1490 20 FV 84.0 10440 Pave Reg AllPub 6 5 ... 1468 0 1.0 0.0 2 0 2 1 528.0 165845.9386
30 1491 60 RL 70.0 11920 Pave Reg AllPub 7 5 ... 831 828 0.0 0.0 2 1 3 1 484.0 180929.6229
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1425 2886 60 RL 78.0 15600 Pave Reg AllPub 5 7 ... 1278 1037 1.0 0.0 2 0 4 1 342.0 208353.2696
1426 2887 30 RM 50.0 7288 Pave Reg AllPub 5 6 ... 976 0 1.0 0.0 1 0 2 1 215.0 166548.0672
1427 2888 50 RM 50.0 7000 Pave Reg AllPub 6 7 ... 861 424 0.0 1.0 1 0 3 1 506.0 175942.4746
1428 2889 30 RM 61.0 8534 Pave Reg AllPub 4 4 ... 672 0 0.0 0.0 1 0 2 1 0.0 166790.4579
1429 2890 30 RM 50.0 7030 Pave Reg AllPub 4 6 ... 641 0 0.0 0.0 1 0 2 1 272.0 160515.8506
1430 2891 50 RM 75.0 9060 Pave Reg AllPub 6 5 ... 967 671 0.0 0.0 2 0 4 1 384.0 192167.6211
1431 2892 30 C (all) 69.0 12366 Pave Reg AllPub 3 5 ... 729 0 0.0 0.0 1 0 2 1 0.0 178751.5511
1432 2893 190 C (all) 50.0 9000 Pave Reg AllPub 5 6 ... 1060 336 0.0 0.0 2 0 4 2 0.0 198678.8941
1433 2894 50 C (all) 60.0 8520 Grvl Reg AllPub 3 5 ... 576 360 0.0 0.0 1 0 2 1 0.0 164553.1203
1434 2895 120 RM 41.0 5748 Pave IR1 AllPub 8 5 ... 1778 0 2.0 0.0 2 0 2 1 495.0 156887.9329
1435 2896 120 RM 44.0 3842 Pave IR1 AllPub 8 5 ... 1646 0 1.0 1.0 2 0 2 1 525.0 164185.7773
1436 2897 20 RL 69.0 23580 Pave IR1 AllPub 6 6 ... 1625 0 0.0 1.0 2 0 3 1 576.0 212992.1206
1437 2898 90 RL 65.0 8385 Pave Reg AllPub 6 5 ... 1664 0 0.0 0.0 2 0 4 2 616.0 197468.5505
1438 2899 20 RL 70.0 9116 Pave Reg AllPub 8 5 ... 1491 0 0.0 0.0 2 0 3 1 490.0 180106.8437
1439 2900 80 RL 140.0 11080 Pave Reg AllPub 6 6 ... 1210 0 1.0 0.0 2 0 3 1 528.0 183972.0711
1442 2903 20 RL 95.0 13618 Pave Reg AllPub 8 5 ... 1960 0 1.0 0.0 2 0 3 1 714.0 195596.3077
1443 2904 20 RL 88.0 11577 Pave Reg AllPub 9 5 ... 1838 0 1.0 0.0 2 0 3 1 682.0 189369.7563
1445 2906 90 RM 78.0 7020 Pave Reg AllPub 7 5 ... 1368 0 2.0 0.0 2 0 2 2 784.0 169335.3106
1446 2907 160 RM 41.0 2665 Pave Reg AllPub 5 6 ... 616 688 0.0 0.0 1 1 3 1 336.0 167411.0284
1447 2908 20 RL 58.0 10172 Pave IR1 AllPub 5 7 ... 874 0 1.0 0.0 1 0 3 1 288.0 187709.5550
1449 2910 180 RM 21.0 1470 Pave Reg AllPub 4 6 ... 630 0 1.0 0.0 1 0 1 1 0.0 137402.5699
1450 2911 160 RM 21.0 1484 Pave Reg AllPub 4 4 ... 546 546 0.0 0.0 1 1 3 1 253.0 165086.7751
1451 2912 20 RL 80.0 13384 Pave Reg AllPub 5 5 ... 1360 0 1.0 0.0 1 0 3 1 336.0 188506.4314
1452 2913 160 RM 21.0 1533 Pave Reg AllPub 4 5 ... 546 546 0.0 0.0 1 1 3 1 286.0 172917.4568
1453 2914 160 RM 21.0 1526 Pave Reg AllPub 4 5 ... 546 546 0.0 0.0 1 1 3 1 0.0 166274.3252
1454 2915 160 RM 21.0 1936 Pave Reg AllPub 4 7 ... 546 546 0.0 0.0 1 1 3 1 0.0 167081.2209
1455 2916 160 RM 21.0 1894 Pave Reg AllPub 4 5 ... 546 546 0.0 0.0 1 1 3 1 286.0 164788.7782
1456 2917 20 RL 160.0 20000 Pave Reg AllPub 5 7 ... 1224 0 1.0 0.0 1 0 4 1 576.0 219222.4234
1457 2918 85 RL 62.0 10441 Pave Reg AllPub 5 5 ... 970 0 0.0 1.0 1 0 3 1 0.0 184924.2797
1458 2919 60 RL 74.0 9627 Pave Reg AllPub 7 5 ... 996 1004 0.0 0.0 2 1 3 1 650.0 187741.8667

1226 rows × 22 columns

We have finished cleaning the dataset. Let us see the description of our final dataset.

In [10]:
data.describe()
Out[10]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt 1stFlrSF 2ndFlrSF BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr GarageArea SalePrice
count 1459.000000 1459.000000 1232.000000 1459.000000 1459.000000 1459.000000 1459.000000 1459.000000 1459.000000 1457.000000 1457.000000 1459.000000 1459.000000 1459.000000 1459.000000 1458.000000 1459.000000
mean 2190.000000 57.378341 68.580357 9819.161069 6.078821 5.553804 1971.357779 1156.534613 325.967786 0.434454 0.065202 1.570939 0.377656 2.854010 1.042495 472.768861 179183.918243
std 421.321334 42.746880 22.376841 4955.517327 1.436812 1.113740 30.390071 398.165820 420.610226 0.530648 0.252468 0.555190 0.503017 0.829788 0.208472 217.048611 16518.303051
min 1461.000000 20.000000 21.000000 1470.000000 1.000000 1.000000 1879.000000 407.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 135751.318900
25% 1825.500000 20.000000 58.000000 7391.000000 5.000000 5.000000 1953.000000 873.500000 0.000000 0.000000 0.000000 1.000000 0.000000 2.000000 1.000000 318.000000 168703.011200
50% 2190.000000 50.000000 67.000000 9399.000000 6.000000 5.000000 1973.000000 1079.000000 0.000000 0.000000 0.000000 2.000000 0.000000 3.000000 1.000000 480.000000 179208.665700
75% 2554.500000 70.000000 80.000000 11517.500000 7.000000 6.000000 2001.000000 1382.500000 676.000000 1.000000 0.000000 2.000000 1.000000 3.000000 1.000000 576.000000 186789.409400
max 2919.000000 190.000000 200.000000 56600.000000 10.000000 9.000000 2010.000000 5095.000000 1862.000000 3.000000 2.000000 4.000000 2.000000 6.000000 2.000000 1488.000000 281643.976100

Data Visualization

Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images.

Let us begin with visualization by importing libraries.

In [11]:
%matplotlib inline
In [12]:
import numpy as np
In [13]:
from scipy import stats, integrate
import matplotlib.pyplot as plt 
In [14]:
import seaborn as sns
sns.set(color_codes= True)

Univariate Distribution:

Univariate analysis involves analysis of only one variable. It doesn't deal with cause of the data. It basically summarizes the data and finds pattern between data. Most convinient way to check univariate distribution in seaborn is displot(). It draws a histogram and fit a kernel density estimate (KDE).

In [15]:
plt.figure(figsize=(9,8))     #this sets the size of figure
sns.distplot(df['SalePrice'],bins=50)  #with this we plot histogram of SalePrice and give it's kernel density estimate.
print("Skewness: %f" % df['SalePrice'].skew())
print("Kurtosis: %f" % df['SalePrice'].kurt())
Skewness: 0.724931
Kurtosis: 2.650423
In [16]:
cor_mat= df[:].corr()
cor_with_tar=cor_mat.sort_values(['SalePrice'],ascending=False)
In [17]:
print("The most relevant features (numeric) for the target are :")
cor_with_tar.SalePrice
The most relevant features (numeric) for the target are :
Out[17]:
SalePrice       1.000000
BedroomAbvGr    0.813514
LotArea         0.686926
LotFrontage     0.499447
2ndFlrSF        0.344355
1stFlrSF        0.319452
FullBath        0.314649
GarageArea      0.245892
HalfBath        0.206062
KitchenAbvGr    0.169055
Id              0.138676
OverallQual     0.107261
YearBuilt       0.015349
BsmtHalfBath   -0.030641
BsmtFullBath   -0.052726
OverallCond    -0.075004
MSSubClass     -0.199732
Name: SalePrice, dtype: float64

INFERENCES:

Note that some of the features have quite high corelation with the target. These features are really significant.

Of these the features with corelation value >0.5 are really important. Some features like BedroomAbvGr etc.. are even more important.

We will consider these features (i.e. BedroomAbvGr,LotArea) etc.. in more detail in subsequent sections during univariate and bivariate analysis.

In [18]:
# using a corelation map to visualize features with high corelation.
cor_mat= df[['BedroomAbvGr','LotArea','LotFrontage','SalePrice']].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x15601479860>

In this section the univariate analysis is performed; More importantly the features that are more importanht with the 'Target' that have high corelation with the Target.

For Numeric features distplot' is used and 'boxplot' is used to analyze their distribution

In [19]:
def plot_cat(feature):
  sns.countplot(data=df,x=feature)
  ax=sns.countplot(data=df,x=feature)
In [20]:
plot_cat('OverallQual')

Bivariate Analyasis:

Bivariate Analysis is the concept to find relationship between two variables and to find how strong is this relationship. It is a form of quantitative analysis.

In [21]:
# garage area
fig, ax = plt.subplots()
ax.scatter(x =(df['GarageArea']), y = df['SalePrice'])
plt.ylabel('SalePrice')
plt.xlabel('GarageArea')
plt.show()
# can try to fremove the points with gargae rea > than 1200.

The salesprice increases with the increase in Garage Area.

In [22]:
# garage area
fig, ax = plt.subplots()
ax.scatter(x =(df['LotFrontage']), y = df['SalePrice'])
plt.ylabel('SalePrice')
plt.xlabel('LotFrontage')
plt.show()
# can try to fremove the points with gargae rea > than 1200.

The SalePrice increases with the overall quality as expected.

In [23]:
sns.jointplot(x="YearBuilt",y="SalePrice",data=df)
Out[23]:
<seaborn.axisgrid.JointGrid at 0x1560141d208>

As it can be observed above maximum number of houses were sold in year 2000 and highest SalePrice was around 2800000 houses in the range from 1400000 to 200000 are having highest sale.

In [24]:
fig1 = fig.add_subplot(221); sns.boxplot(x='OverallQual', y='SalePrice', data=df[['SalePrice', 'OverallQual']])
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x15601f06ba8>

It can be observed from the above box plots that as the overall quality of the houses increases, the mean sale price also increases.

Here we are plotting relationship between 3 variables.

In [25]:
fig4 = fig.add_subplot(224);
sns.scatterplot(x = df.GarageArea, y = df.SalePrice, hue=df.OverallQual, palette= 'Spectral')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x15601f29a90>
In [27]:
fig7 = fig.add_subplot(122);
sns.scatterplot(y = df.SalePrice, x = df['1stFlrSF'], hue=df.OverallQual, palette= 'YlOrRd')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1560143c6d8>

Bathrooms Features

There are 4 bathroom variables in our data set. FullBath has the largest correlation with SalePrice between the others individually.

In [28]:
fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(221); sns.regplot(x='FullBath', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.FullBath.corr(df['SalePrice'])))
fig2 = fig.add_subplot(222); sns.regplot(x='HalfBath', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.HalfBath.corr(df['SalePrice'])))
fig3 = fig.add_subplot(223); sns.regplot(x='BsmtFullBath', y='SalePrice',
data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.BsmtFullBath.corr(df['SalePrice'])))
fig4 = fig.add_subplot(224); sns.regplot(x='BsmtHalfBath', y='SalePrice',
data=df);plt.title('Correlation with SalePrice: {:6.4f}'.format(df.HalfBath.corr(df['SalePrice'])))
plt.show()

REGRESSION ANALYSIS

In [29]:
for col in df.columns: 
    print(col)
Id
MSSubClass
MSZoning
LotFrontage
LotArea
Street
LotShape
Utilities
OverallQual
OverallCond
YearBuilt
Heating
1stFlrSF
2ndFlrSF
BsmtFullBath
BsmtHalfBath
FullBath
HalfBath
BedroomAbvGr
KitchenAbvGr
GarageArea
SalePrice
In [30]:
results = df.dtypes        # check the data type of result table
print(results)
Id                int64
MSSubClass        int64
MSZoning         object
LotFrontage     float64
LotArea           int64
Street           object
LotShape         object
Utilities        object
OverallQual       int64
OverallCond       int64
YearBuilt         int64
Heating          object
1stFlrSF          int64
2ndFlrSF          int64
BsmtFullBath    float64
BsmtHalfBath    float64
FullBath          int64
HalfBath          int64
BedroomAbvGr      int64
KitchenAbvGr      int64
GarageArea      float64
SalePrice       float64
dtype: object
In [31]:
#Converting columns to categorical columns in order to do regression:
continuous_columns = ['Id','MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','1stFlrSF','2ndFlrSF','BsmtFullBath','BsmtHalfBath',
                    'FullBath','HalfBath','BedroomAbvGr','GarageArea','SalePrice','KitchenAbvGr']
In [32]:
categorical_columns = [col for col in df.columns if col not in continuous_columns]
print(categorical_columns)
['MSZoning', 'Street', 'LotShape', 'Utilities', 'Heating']
In [33]:
# Carrying out hot encoding in order to assign equal weights to categorical columns
# Columns converted to catogorical columns using as_ordered. 
for col_name, col in df[categorical_columns].items():
    df[col_name] = col.astype('category').cat.as_ordered()
print(df.dtypes)
Id                 int64
MSSubClass         int64
MSZoning        category
LotFrontage      float64
LotArea            int64
Street          category
LotShape        category
Utilities       category
OverallQual        int64
OverallCond        int64
YearBuilt          int64
Heating         category
1stFlrSF           int64
2ndFlrSF           int64
BsmtFullBath     float64
BsmtHalfBath     float64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
GarageArea       float64
SalePrice        float64
dtype: object
In [34]:
#Using dummy variables in order to replace categorical columns
housing_data = df.copy()
housing_train_data = pd.get_dummies(housing_data, columns=['MSZoning','LotShape','Utilities','Heating','Street'], prefix = ['MSZoning','LotShape','Utilities','Heating','Street'])

print(housing_train_data.dtypes)
Id                    int64
MSSubClass            int64
LotFrontage         float64
LotArea               int64
OverallQual           int64
OverallCond           int64
YearBuilt             int64
1stFlrSF              int64
2ndFlrSF              int64
BsmtFullBath        float64
BsmtHalfBath        float64
FullBath              int64
HalfBath              int64
BedroomAbvGr          int64
KitchenAbvGr          int64
GarageArea          float64
SalePrice           float64
MSZoning_C (all)      uint8
MSZoning_FV           uint8
MSZoning_RH           uint8
MSZoning_RL           uint8
MSZoning_RM           uint8
LotShape_IR1          uint8
LotShape_IR2          uint8
LotShape_IR3          uint8
LotShape_Reg          uint8
Utilities_AllPub      uint8
Heating_GasA          uint8
Heating_GasW          uint8
Heating_Grav          uint8
Heating_Wall          uint8
Street_Grvl           uint8
Street_Pave           uint8
dtype: object
In [35]:
print(housing_train_data)
        Id  MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  \
0     1461          20         80.0    11622            5            6   
1     1462          20         81.0    14267            6            6   
2     1463          60         74.0    13830            5            5   
3     1464          60         78.0     9978            6            6   
4     1465         120         43.0     5005            8            5   
5     1466          60         75.0    10000            6            5   
7     1468          60         63.0     8402            6            5   
8     1469          20         85.0    10176            7            5   
9     1470          20         70.0     8400            4            5   
10    1471         120         26.0     5858            7            5   
11    1472         160         21.0     1680            6            5   
12    1473         160         21.0     1680            5            5   
13    1474         160         24.0     2280            6            6   
14    1475         120         24.0     2280            7            6   
15    1476          60        102.0    12858            9            5   
16    1477          20         94.0    12883            8            5   
17    1478          20         90.0    11520            9            5   
18    1479          20         79.0    14122            8            5   
19    1480          20        110.0    14300            9            5   
20    1481          60        105.0    13650            8            5   
21    1482         120         41.0     7132            8            5   
22    1483          20        100.0    18494            6            5   
23    1484         120         43.0     3203            7            5   
24    1485          80         67.0    13300            7            5   
25    1486          60         63.0     8577            7            5   
26    1487          60         60.0    17433            8            5   
27    1488          20         73.0     8987            8            5   
28    1489          20         92.0     9215            7            5   
29    1490          20         84.0    10440            6            5   
30    1491          60         70.0    11920            7            5   
...    ...         ...          ...      ...          ...          ...   
1425  2886          60         78.0    15600            5            7   
1426  2887          30         50.0     7288            5            6   
1427  2888          50         50.0     7000            6            7   
1428  2889          30         61.0     8534            4            4   
1429  2890          30         50.0     7030            4            6   
1430  2891          50         75.0     9060            6            5   
1431  2892          30         69.0    12366            3            5   
1432  2893         190         50.0     9000            5            6   
1433  2894          50         60.0     8520            3            5   
1434  2895         120         41.0     5748            8            5   
1435  2896         120         44.0     3842            8            5   
1436  2897          20         69.0    23580            6            6   
1437  2898          90         65.0     8385            6            5   
1438  2899          20         70.0     9116            8            5   
1439  2900          80        140.0    11080            6            6   
1442  2903          20         95.0    13618            8            5   
1443  2904          20         88.0    11577            9            5   
1445  2906          90         78.0     7020            7            5   
1446  2907         160         41.0     2665            5            6   
1447  2908          20         58.0    10172            5            7   
1449  2910         180         21.0     1470            4            6   
1450  2911         160         21.0     1484            4            4   
1451  2912          20         80.0    13384            5            5   
1452  2913         160         21.0     1533            4            5   
1453  2914         160         21.0     1526            4            5   
1454  2915         160         21.0     1936            4            7   
1455  2916         160         21.0     1894            4            5   
1456  2917          20        160.0    20000            5            7   
1457  2918          85         62.0    10441            5            5   
1458  2919          60         74.0     9627            7            5   

      YearBuilt  1stFlrSF  2ndFlrSF  BsmtFullBath     ...       LotShape_IR2  \
0          1961       896         0           0.0     ...                  0   
1          1958      1329         0           0.0     ...                  0   
2          1997       928       701           0.0     ...                  0   
3          1998       926       678           0.0     ...                  0   
4          1992      1280         0           0.0     ...                  0   
5          1993       763       892           0.0     ...                  0   
7          1998       789       676           0.0     ...                  0   
8          1990      1341         0           1.0     ...                  0   
9          1970       882         0           1.0     ...                  0   
10         1999      1337         0           1.0     ...                  0   
11         1971       483       504           0.0     ...                  0   
12         1971       525       567           0.0     ...                  0   
13         1975       855       601           0.0     ...                  0   
14         1975       836         0           0.0     ...                  0   
15         2009      1627       707           0.0     ...                  0   
16         2009      1544         0           0.0     ...                  0   
17         2005      1698         0           0.0     ...                  0   
18         2005      1822         0           0.0     ...                  0   
19         2003      2696         0           1.0     ...                  0   
20         2002      1687       563           1.0     ...                  0   
21         2006      1370         0           0.0     ...                  0   
22         2005      1324         0           0.0     ...                  0   
23         2006      1145         0           0.0     ...                  0   
24         2004       744       630           1.0     ...                  0   
25         2004       847       886           0.0     ...                  0   
26         1998      1645       830           0.0     ...                  1   
27         2005      1595         0           0.0     ...                  0   
28         2009      1218         0           0.0     ...                  0   
29         2005      1468         0           1.0     ...                  0   
30         2004       831       828           0.0     ...                  0   
...         ...       ...       ...           ...     ...                ...   
1425       1950      1278      1037           1.0     ...                  0   
1426       1942       976         0           1.0     ...                  0   
1427       1926       861       424           0.0     ...                  0   
1428       1925       672         0           0.0     ...                  0   
1429       1925       641         0           0.0     ...                  0   
1430       1957       967       671           0.0     ...                  0   
1431       1945       729         0           0.0     ...                  0   
1432       1951      1060       336           0.0     ...                  0   
1433       1916       576       360           0.0     ...                  0   
1434       2005      1778         0           2.0     ...                  0   
1435       2004      1646         0           1.0     ...                  0   
1436       1979      1625         0           0.0     ...                  0   
1437       1978      1664         0           0.0     ...                  0   
1438       2001      1491         0           0.0     ...                  0   
1439       1975      1210         0           1.0     ...                  0   
1442       2005      1960         0           1.0     ...                  0   
1443       2005      1838         0           1.0     ...                  0   
1445       1997      1368         0           2.0     ...                  0   
1446       1977       616       688           0.0     ...                  0   
1447       1968       874         0           1.0     ...                  0   
1449       1970       630         0           1.0     ...                  0   
1450       1972       546       546           0.0     ...                  0   
1451       1969      1360         0           1.0     ...                  0   
1452       1970       546       546           0.0     ...                  0   
1453       1970       546       546           0.0     ...                  0   
1454       1970       546       546           0.0     ...                  0   
1455       1970       546       546           0.0     ...                  0   
1456       1960      1224         0           1.0     ...                  0   
1457       1992       970         0           0.0     ...                  0   
1458       1993       996      1004           0.0     ...                  0   

      LotShape_IR3  LotShape_Reg  Utilities_AllPub  Heating_GasA  \
0                0             1                 1             1   
1                0             0                 1             1   
2                0             0                 1             1   
3                0             0                 1             1   
4                0             0                 1             1   
5                0             0                 1             1   
7                0             0                 1             1   
8                0             1                 1             1   
9                0             1                 1             1   
10               0             0                 1             1   
11               0             1                 1             1   
12               0             1                 1             1   
13               0             1                 1             1   
14               0             1                 1             1   
15               0             0                 1             1   
16               0             0                 1             1   
17               0             1                 1             1   
18               0             0                 1             1   
19               0             1                 1             1   
20               0             1                 1             1   
21               0             0                 1             1   
22               0             0                 1             1   
23               0             1                 1             1   
24               0             0                 1             1   
25               0             0                 1             1   
26               0             0                 1             1   
27               0             1                 1             1   
28               0             1                 1             1   
29               0             1                 1             1   
30               0             1                 1             1   
...            ...           ...               ...           ...   
1425             0             1                 1             1   
1426             0             1                 1             1   
1427             0             1                 1             1   
1428             0             1                 1             1   
1429             0             1                 1             1   
1430             0             1                 1             1   
1431             0             1                 1             1   
1432             0             1                 1             1   
1433             0             1                 1             1   
1434             0             0                 1             1   
1435             0             0                 1             1   
1436             0             0                 1             1   
1437             0             1                 1             1   
1438             0             1                 1             1   
1439             0             1                 1             1   
1442             0             1                 1             1   
1443             0             1                 1             1   
1445             0             1                 1             1   
1446             0             1                 1             1   
1447             0             0                 1             1   
1449             0             1                 1             1   
1450             0             1                 1             1   
1451             0             1                 1             1   
1452             0             1                 1             1   
1453             0             1                 1             1   
1454             0             1                 1             1   
1455             0             1                 1             1   
1456             0             1                 1             1   
1457             0             1                 1             1   
1458             0             1                 1             1   

      Heating_GasW  Heating_Grav  Heating_Wall  Street_Grvl  Street_Pave  
0                0             0             0            0            1  
1                0             0             0            0            1  
2                0             0             0            0            1  
3                0             0             0            0            1  
4                0             0             0            0            1  
5                0             0             0            0            1  
7                0             0             0            0            1  
8                0             0             0            0            1  
9                0             0             0            0            1  
10               0             0             0            0            1  
11               0             0             0            0            1  
12               0             0             0            0            1  
13               0             0             0            0            1  
14               0             0             0            0            1  
15               0             0             0            0            1  
16               0             0             0            0            1  
17               0             0             0            0            1  
18               0             0             0            0            1  
19               0             0             0            0            1  
20               0             0             0            0            1  
21               0             0             0            0            1  
22               0             0             0            0            1  
23               0             0             0            0            1  
24               0             0             0            0            1  
25               0             0             0            0            1  
26               0             0             0            0            1  
27               0             0             0            0            1  
28               0             0             0            0            1  
29               0             0             0            0            1  
30               0             0             0            0            1  
...            ...           ...           ...          ...          ...  
1425             0             0             0            0            1  
1426             0             0             0            0            1  
1427             0             0             0            0            1  
1428             0             0             0            0            1  
1429             0             0             0            0            1  
1430             0             0             0            0            1  
1431             0             0             0            0            1  
1432             0             0             0            0            1  
1433             0             0             0            1            0  
1434             0             0             0            0            1  
1435             0             0             0            0            1  
1436             0             0             0            0            1  
1437             0             0             0            0            1  
1438             0             0             0            0            1  
1439             0             0             0            0            1  
1442             0             0             0            0            1  
1443             0             0             0            0            1  
1445             0             0             0            0            1  
1446             0             0             0            0            1  
1447             0             0             0            0            1  
1449             0             0             0            0            1  
1450             0             0             0            0            1  
1451             0             0             0            0            1  
1452             0             0             0            0            1  
1453             0             0             0            0            1  
1454             0             0             0            0            1  
1455             0             0             0            0            1  
1456             0             0             0            0            1  
1457             0             0             0            0            1  
1458             0             0             0            0            1  

[1226 rows x 33 columns]
In [36]:
# Plotting the correlation matrix and observe the heatmap
corrmatrix = housing_train_data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmatrix, vmax=.8, square=True);
In [37]:
# correlation matrix with the correlation values in heatmap
plt.figure(figsize=(100,80))
cor = housing_train_data.corr()
sns.set(font_scale=4)
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds,annot_kws={"size": 45},vmax=.8, square=True)
plt.show()
In [38]:
#take the threshold value as 0.2 and all the columns having value greater than threshold are selected for regression
threshold = 0.2
filtered_data = abs(cor['SalePrice'])
result = filtered_data[filtered_data>0.2]
print(result)
LotFrontage     0.499447
LotArea         0.686926
1stFlrSF        0.319452
2ndFlrSF        0.344355
FullBath        0.314649
HalfBath        0.206062
BedroomAbvGr    0.813514
GarageArea      0.245892
SalePrice       1.000000
MSZoning_RL     0.278746
MSZoning_RM     0.217509
Name: SalePrice, dtype: float64
In [39]:
#form a final train data on which regression is to be done
final_train_data = housing_train_data[['MSSubClass','LotFrontage','LotArea','1stFlrSF','2ndFlrSF','FullBath','HalfBath',
                    'BedroomAbvGr','GarageArea','SalePrice','MSZoning_RL','MSZoning_RM']]
print(final_train_data)
      MSSubClass  LotFrontage  LotArea  1stFlrSF  2ndFlrSF  FullBath  \
0             20         80.0    11622       896         0         1   
1             20         81.0    14267      1329         0         1   
2             60         74.0    13830       928       701         2   
3             60         78.0     9978       926       678         2   
4            120         43.0     5005      1280         0         2   
5             60         75.0    10000       763       892         2   
7             60         63.0     8402       789       676         2   
8             20         85.0    10176      1341         0         1   
9             20         70.0     8400       882         0         1   
10           120         26.0     5858      1337         0         2   
11           160         21.0     1680       483       504         1   
12           160         21.0     1680       525       567         1   
13           160         24.0     2280       855       601         2   
14           120         24.0     2280       836         0         1   
15            60        102.0    12858      1627       707         2   
16            20         94.0    12883      1544         0         2   
17            20         90.0    11520      1698         0         2   
18            20         79.0    14122      1822         0         2   
19            20        110.0    14300      2696         0         2   
20            60        105.0    13650      1687       563         2   
21           120         41.0     7132      1370         0         2   
22            20        100.0    18494      1324         0         2   
23           120         43.0     3203      1145         0         2   
24            80         67.0    13300       744       630         2   
25            60         63.0     8577       847       886         2   
26            60         60.0    17433      1645       830         2   
27            20         73.0     8987      1595         0         2   
28            20         92.0     9215      1218         0         2   
29            20         84.0    10440      1468         0         2   
30            60         70.0    11920       831       828         2   
...          ...          ...      ...       ...       ...       ...   
1425          60         78.0    15600      1278      1037         2   
1426          30         50.0     7288       976         0         1   
1427          50         50.0     7000       861       424         1   
1428          30         61.0     8534       672         0         1   
1429          30         50.0     7030       641         0         1   
1430          50         75.0     9060       967       671         2   
1431          30         69.0    12366       729         0         1   
1432         190         50.0     9000      1060       336         2   
1433          50         60.0     8520       576       360         1   
1434         120         41.0     5748      1778         0         2   
1435         120         44.0     3842      1646         0         2   
1436          20         69.0    23580      1625         0         2   
1437          90         65.0     8385      1664         0         2   
1438          20         70.0     9116      1491         0         2   
1439          80        140.0    11080      1210         0         2   
1442          20         95.0    13618      1960         0         2   
1443          20         88.0    11577      1838         0         2   
1445          90         78.0     7020      1368         0         2   
1446         160         41.0     2665       616       688         1   
1447          20         58.0    10172       874         0         1   
1449         180         21.0     1470       630         0         1   
1450         160         21.0     1484       546       546         1   
1451          20         80.0    13384      1360         0         1   
1452         160         21.0     1533       546       546         1   
1453         160         21.0     1526       546       546         1   
1454         160         21.0     1936       546       546         1   
1455         160         21.0     1894       546       546         1   
1456          20        160.0    20000      1224         0         1   
1457          85         62.0    10441       970         0         1   
1458          60         74.0     9627       996      1004         2   

      HalfBath  BedroomAbvGr  GarageArea    SalePrice  MSZoning_RL  \
0            0             2       730.0  169277.0525            0   
1            1             3       312.0  187758.3940            1   
2            1             3       482.0  183583.6836            1   
3            1             3       470.0  179317.4775            1   
4            0             2       506.0  150730.0800            1   
5            1             3       440.0  177150.9892            1   
7            1             3       393.0  175110.9565            1   
8            1             2       506.0  162011.6988            1   
9            0             2       525.0  160726.2478            1   
10           0             2       511.0  157933.2795            0   
11           1             2       264.0  145291.2450            0   
12           1             3       320.0  159672.0176            0   
13           1             3       440.0  164167.5183            1   
14           0             2       308.0  150891.6382            1   
15           1             3       751.0  179460.9652            1   
16           0             3       868.0  185034.6289            1   
17           0             3       730.0  182352.1926            1   
18           0             3       678.0  183053.4582            1   
19           1             3       958.0  187823.3393            1   
20           1             3       756.0  186544.1143            1   
21           0             2       484.0  158230.7752            1   
22           0             3       430.0  190552.8293            1   
23           0             2       437.0  147183.6749            1   
24           1             3       400.0  185855.3009            1   
25           1             3       433.0  174350.4707            1   
26           1             4       962.0  201740.6207            1   
27           0             2       880.0  162986.3789            1   
28           0             2       676.0  162330.1991            0   
29           0             2       528.0  165845.9386            0   
30           1             3       484.0  180929.6229            1   
...        ...           ...         ...          ...          ...   
1425         0             4       342.0  208353.2696            1   
1426         0             2       215.0  166548.0672            0   
1427         0             3       506.0  175942.4746            0   
1428         0             2         0.0  166790.4579            0   
1429         0             2       272.0  160515.8506            0   
1430         0             4       384.0  192167.6211            0   
1431         0             2         0.0  178751.5511            0   
1432         0             4         0.0  198678.8941            0   
1433         0             2         0.0  164553.1203            0   
1434         0             2       495.0  156887.9329            0   
1435         0             2       525.0  164185.7773            0   
1436         0             3       576.0  212992.1206            1   
1437         0             4       616.0  197468.5505            1   
1438         0             3       490.0  180106.8437            1   
1439         0             3       528.0  183972.0711            1   
1442         0             3       714.0  195596.3077            1   
1443         0             3       682.0  189369.7563            1   
1445         0             2       784.0  169335.3106            0   
1446         1             3       336.0  167411.0284            0   
1447         0             3       288.0  187709.5550            1   
1449         0             1         0.0  137402.5699            0   
1450         1             3       253.0  165086.7751            0   
1451         0             3       336.0  188506.4314            1   
1452         1             3       286.0  172917.4568            0   
1453         1             3         0.0  166274.3252            0   
1454         1             3         0.0  167081.2209            0   
1455         1             3       286.0  164788.7782            0   
1456         0             4       576.0  219222.4234            1   
1457         0             3         0.0  184924.2797            1   
1458         1             3       650.0  187741.8667            1   

      MSZoning_RM  
0               0  
1               0  
2               0  
3               0  
4               0  
5               0  
7               0  
8               0  
9               0  
10              0  
11              1  
12              1  
13              0  
14              0  
15              0  
16              0  
17              0  
18              0  
19              0  
20              0  
21              0  
22              0  
23              0  
24              0  
25              0  
26              0  
27              0  
28              0  
29              0  
30              0  
...           ...  
1425            0  
1426            1  
1427            1  
1428            1  
1429            1  
1430            1  
1431            0  
1432            0  
1433            0  
1434            1  
1435            1  
1436            0  
1437            0  
1438            0  
1439            0  
1442            0  
1443            0  
1445            1  
1446            1  
1447            0  
1449            1  
1450            1  
1451            0  
1452            1  
1453            1  
1454            1  
1455            1  
1456            0  
1457            0  
1458            0  

[1226 rows x 12 columns]
In [40]:
# X and Y coefficients required for regression:
X = final_train_data[['MSSubClass','LotFrontage','LotArea','1stFlrSF','2ndFlrSF','FullBath','HalfBath',
                    'BedroomAbvGr','GarageArea','MSZoning_RL','MSZoning_RM']]
Y = final_train_data[['SalePrice']]

Spliting the data into training and testing set

In [41]:
#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split, KFold, cross_val_score
xtrain, xtest, ytrain, ytest = train_test_split(X,Y,test_size=1/3, random_state=0)
# 1/3rd of the train data will be selected as test data
print("xtrain : " + str(xtrain.shape))
print("xtest : " + str(xtest.shape))
print("ytrain : " + str(ytrain.shape))
print("ytest : " + str(ytest.shape))
print(xtrain)
print(ytrain)
xtrain : (817, 11)
xtest : (409, 11)
ytrain : (817, 1)
ytest : (409, 1)
      MSSubClass  LotFrontage  LotArea  1stFlrSF  2ndFlrSF  FullBath  \
858           60         64.0     7713       728       728         2   
61            20         80.0    10032      1580         0         1   
796           60         60.0     7500       957      1342         3   
1339          30         50.0     9000       819         0         1   
1320          30         50.0     5633       844         0         1   
933           80         80.0     9200      1042         0         2   
1329          90         33.0     5976       624       624         2   
822          160         21.0     2205       525       567         1   
608           20         60.0    10122       869         0         1   
802           60         98.0    12328      1164      1377         3   
554           60         43.0    14565       832       825         2   
817           20         63.0     9457      1422         0         1   
588           90         72.0    11072      1728         0         2   
844          120         34.0     3901      1302         0         1   
330           90         81.0     9671      1248      1296         2   
981           30         56.0     3153       967         0         1   
434           20         70.0    11767      1560         0         1   
869           60         59.0     9649       961       683         2   
331           20         83.0    10143      1380         0         1   
523          160         36.0     2448       764       862         2   
835           60        134.0    16659      1582       570         2   
946           20         65.0     6860      1008         0         1   
918           60         80.0     8000      1403      1008         2   
1404         180         35.0     3675      1072         0         1   
1028          20         91.0    11375      1299         0         1   
241           20         86.0    11194      1696         0         2   
265           60         70.0     7703       833       897         2   
88            50         52.0     6240       548       492         1   
219          120         44.0     7390      1884         0         2   
10           120         26.0     5858      1337         0         2   
...          ...          ...      ...       ...       ...       ...   
1400          20         73.0    16133      1176         0         1   
795           60         63.0     7875       691       862         2   
950           20         72.0     9770       922         0         1   
1432         190         50.0     9000      1060       336         2   
112           20         44.0    17485      1508         0         1   
1009          80        120.0    13200      1497       797         3   
533           60         41.0    12460      1037      1285         2   
900           20         84.0     7476      1040         0         1   
1169          60         63.0    12292      1102      1371         2   
131           30         67.0     4853      1296         0         2   
925           20         80.0    10800      1052         0         1   
80            70         60.0    10560       868       602         1   
1003          70         47.0     7755      1100      1164         2   
635           30         60.0     6756       899         0         1   
1008          50         40.0     5680       969       245         1   
203           20        105.0    13693      2674         0         2   
96           190         60.0     9900      1212       180         1   
649           30         55.0     7111      1143         0         1   
846          120         41.0     6289      1362         0         2   
368           30         50.0     8635      1072         0         1   
722           20         74.0     9627      1361         0         2   
1312          70         62.0     9856      1007      1007         2   
721           20         85.0    11058      1564         0         2   
329           30         60.0    10800       725         0         1   
1236          60         70.0     8304       941       896         2   
909          160         24.0     2280       757       744         2   
992           20         52.0     8626       968         0         1   
1449         180         21.0     1470       630         0         1   
659           50         60.0     5520       565       651         1   
824          120         24.0     2304      1055         0         2   

      HalfBath  BedroomAbvGr  GarageArea  MSZoning_RL  MSZoning_RM  
858          1             3       400.0            1            0  
61           1             3       440.0            1            0  
796          1             5       482.0            1            0  
1339         0             2         0.0            0            1  
1320         0             2       216.0            0            1  
933          0             3       440.0            1            0  
1329         0             2         0.0            0            1  
822          1             3       264.0            0            1  
608          0             1       390.0            1            0  
802          1             4       729.0            1            0  
554          1             3       483.0            1            0  
817          0             3       576.0            1            0  
588          0             6       576.0            1            0  
844          1             1       631.0            1            0  
330          2             6       907.0            1            0  
981          0             2       180.0            0            1  
434          0             2       313.0            1            0  
869          1             3       460.0            1            0  
331          1             3       364.0            1            0  
523          1             2       474.0            1            0  
835          1             3       728.0            1            0  
946          0             3       308.0            1            0  
918          1             4       570.0            1            0  
1404         0             2       525.0            0            1  
1028         0             3       494.0            1            0  
241          0             3       972.0            1            0  
265          1             3       528.0            1            0  
88           0             2       624.0            0            1  
219          0             2       649.0            1            0  
10           0             2       511.0            0            0  
...        ...           ...         ...          ...          ...  
1400         0             2       360.0            1            0  
795          1             3       420.0            1            0  
950          0             2       308.0            1            0  
1432         0             4         0.0            0            0  
112          0             1       572.0            1            0  
1009         0             5       658.0            0            1  
533          1             4       400.0            1            0  
900          1             3       686.0            1            0  
1169         1             4       675.0            1            0  
131          0             2       260.0            1            0  
925          0             3       311.0            1            0  
80           1             2       624.0            0            1  
1003         1             4       408.0            0            1  
635          0             2       200.0            0            1  
1008         0             2       216.0            0            1  
203          1             2       762.0            1            0  
96           0             3         0.0            1            0  
649          0             2       288.0            1            0  
846          0             2       460.0            1            0  
368          0             2      1184.0            0            1  
722          0             3       610.0            1            0  
1312         0             5       624.0            0            1  
721          0             3       814.0            1            0  
329          1             1       320.0            1            0  
1236         1             3       688.0            1            0  
909          1             3       440.0            0            0  
992          0             2       331.0            0            1  
1449         0             1         0.0            0            1  
659          0             3       355.0            0            1  
824          0             2       319.0            1            0  

[817 rows x 11 columns]
        SalePrice
858   178657.7941
61    179423.7516
796   205895.2546
1339  171022.2414
1320  162186.0712
933   179374.4820
1329  168385.5711
822   162293.3761
608   157054.2539
802   201016.1711
554   195664.6861
817   184299.8381
588   228617.9683
844   145708.7643
330   221648.3920
981   156407.6592
434   169355.2025
869   180258.1312
331   180539.8808
523   157437.1928
835   195158.9726
946   182503.5201
918   191393.6086
1404  161647.3298
1028  189179.4282
241   185922.9668
265   181262.3300
88    157580.1776
219   160741.0986
10    157933.2795
...           ...
1400  188374.9367
795   180081.5088
950   166115.4912
1432  198678.8941
112   162015.3185
1009  212693.5030
533   198168.4901
900   177086.4764
1169  201842.9987
131   154850.5084
925   179208.6657
80    167186.9958
1003  189806.5466
635   164810.6095
1008  163590.6772
203   170935.8592
96    174744.4003
649   163299.4780
846   161474.5349
368   166505.9840
722   180422.0500
1312  209219.8448
721   178818.7426
329   160805.5846
1236  180718.5817
909   170175.2268
992   164968.9479
1449  137402.5699
659   170129.5314
824   151422.1169

[817 rows x 1 columns]
In [42]:
xtrain = xtrain.fillna(0) 
In [43]:
(np.isnan(xtrain))
Out[43]:
MSSubClass LotFrontage LotArea 1stFlrSF 2ndFlrSF FullBath HalfBath BedroomAbvGr GarageArea MSZoning_RL MSZoning_RM
858 False False False False False False False False False False False
61 False False False False False False False False False False False
796 False False False False False False False False False False False
1339 False False False False False False False False False False False
1320 False False False False False False False False False False False
933 False False False False False False False False False False False
1329 False False False False False False False False False False False
822 False False False False False False False False False False False
608 False False False False False False False False False False False
802 False False False False False False False False False False False
554 False False False False False False False False False False False
817 False False False False False False False False False False False
588 False False False False False False False False False False False
844 False False False False False False False False False False False
330 False False False False False False False False False False False
981 False False False False False False False False False False False
434 False False False False False False False False False False False
869 False False False False False False False False False False False
331 False False False False False False False False False False False
523 False False False False False False False False False False False
835 False False False False False False False False False False False
946 False False False False False False False False False False False
918 False False False False False False False False False False False
1404 False False False False False False False False False False False
1028 False False False False False False False False False False False
241 False False False False False False False False False False False
265 False False False False False False False False False False False
88 False False False False False False False False False False False
219 False False False False False False False False False False False
10 False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ...
1400 False False False False False False False False False False False
795 False False False False False False False False False False False
950 False False False False False False False False False False False
1432 False False False False False False False False False False False
112 False False False False False False False False False False False
1009 False False False False False False False False False False False
533 False False False False False False False False False False False
900 False False False False False False False False False False False
1169 False False False False False False False False False False False
131 False False False False False False False False False False False
925 False False False False False False False False False False False
80 False False False False False False False False False False False
1003 False False False False False False False False False False False
635 False False False False False False False False False False False
1008 False False False False False False False False False False False
203 False False False False False False False False False False False
96 False False False False False False False False False False False
649 False False False False False False False False False False False
846 False False False False False False False False False False False
368 False False False False False False False False False False False
722 False False False False False False False False False False False
1312 False False False False False False False False False False False
721 False False False False False False False False False False False
329 False False False False False False False False False False False
1236 False False False False False False False False False False False
909 False False False False False False False False False False False
992 False False False False False False False False False False False
1449 False False False False False False False False False False False
659 False False False False False False False False False False False
824 False False False False False False False False False False False

817 rows × 11 columns

In [44]:
xtest = xtest.fillna(0)
ytest = ytest.fillna(0)

Performing multiple linear regression using the least square method

In [45]:
import sklearn
from sklearn import linear_model
model_reg = linear_model.LinearRegression()    # form a linear regression model
In [46]:
ytrain = ytrain.fillna(0)
In [47]:
model_reg.fit(xtrain,ytrain)  # check for fit of regression model
Out[47]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [48]:
model_reg.score(xtrain,ytrain)  # check the score of regression model
Out[48]:
0.9512519845964049
In [49]:
model_reg.coef_
Out[49]:
array([[ 2.62367221e+00, -4.61232867e+00,  2.00775613e+00,
        -2.31956830e-01, -1.79324640e-01, -1.13598669e+02,
        -1.94142292e+02,  1.34823355e+04,  4.34307634e-01,
         1.29141441e+02, -2.39764954e+02]])
In [50]:
model_reg.intercept_
Out[50]:
array([121507.58127551])
In [51]:
predictions = model_reg.predict(xtest)   # we can observe the predictions in an excel file on the computer by following below path
In [55]:
predictionsdf = pd.DataFrame(predictions)
predictionsdf.to_excel("C:/Users/Admin/Desktop/output1.xlsx")
In [56]:
ytest.to_excel("C:/Users/Admin/Desktop/output1.xlsx")
In [57]:
from sklearn import metrics
print("Evaluation of the prediction model : Mean Absolute error")
print(metrics.mean_absolute_error(ytest,predictionsdf))
print("Mean Squared error")
print(metrics.mean_squared_error(ytest,predictionsdf))
Evaluation of the prediction model : Mean Absolute error
2915.135547907341
Mean Squared error
12587212.795744263

Performing Ridge regression

In [59]:
from sklearn.linear_model import Ridge
ridgemodel = Ridge(alpha=0)
ridgemodel.fit(xtrain, ytrain)
Out[59]:
Ridge(alpha=0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)
In [60]:
Ridge(alpha=0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)
Out[60]:
Ridge(alpha=0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)
In [61]:
ridge_score = ridgemodel.score(xtrain,ytrain) # Calculating the score of the model
print(ridge_score)
0.9512519845964049

The ridge score is equal to the least square method (R2) square score, thus indicating that the data is not affected by multi collinearity.

Plot of actual_values vs predicted_values

In [63]:
## The line / model
plt.figure(figsize=(40,30))
plt.scatter(ytest, predictions)
plt.xlabel('Actual_values')
plt.ylabel('Predicted_values')
Out[63]:
Text(0, 0.5, 'Predicted_values')

The plot of actual vs predicted value is almost a straight line, thus indicating that the model is a good fit.

In [65]:
#Plotting residual plots in order to evaluate performance of the model
y_test_df = pd.DataFrame(ytest)
y_test_df.columns = ['SalePrice']
predictionsdf.columns = ['SalePrice']
true_val = y_test_df['SalePrice'].values.copy()
pred_val = predictionsdf['SalePrice'].values.copy()
residual = true_val - pred_val
fig, ax = plt.subplots(figsize=(40,20))
plt.xlabel('Predictated value for Sale price')
plt.ylabel('Residual')
_ = ax.scatter(pred_val,residual)

The plot of predicted values of sale price vs the residuals, displays values that are randomly scattered, thus indicating that there are no patterns. A random graph indicates that the model is a good fit for the data.

CONCLUSION

We can hereby conclude that having a bedroom above ground level is highly correlated with our main attribute SalePrice and also affects the house value, the most. The other attributes that largely impact the value of sales price are Full Bath, Lot frontage, lot area and bedroom above the ground level etc. These factors need to be taken into consideration while buying a house as they largely affect the values of sales price. Also, multiple linear regression best fits our dataset as the R square value is 0.96 which is a good number.